This is my solution for the Data Scientist Assignment for spitogatos.gr
The description as well as the solution of each part will be presented below in an Rmarkdown format which is practically an html document.It is my idea of providing an easy to read and convenient solution for this specific task
You can find the clean R code , as well as the markdown file in my github profile here


drawing





Context


SpaN is a company that provides an online portal for real estate services. The main functionality of the portal is that property listings are published by real estate agents and visitors can search for properties based on a set of search criteria.

As a Data Scientist for SpaN, you closely cooperate with various departments to assist in decision making and strategy execution, based on actionable insights that you get out of data.

The following assignment consists of 3 parts. Each part has specific requirements and deliverables as described in the corresponding sections. The implementation needs to be either in Python or in R. After you complete the assignment, you need to provide a link to a public git repository that includes your code (in one file or several folders).




The Dataset


The given dataset is this month’s snapshot of all the listings of residential properties (houses) for sale listed on the portal of SpaN for 4 specific areas of Athens (geography_name) . Each listing has a unique id but it can be the case that the same actual property is uploaded by multiple real estate agents so multiple different listings (with different ids) for the same property are created. Each agent is identified by a unique agent_id .

The rank that listings are ordered by in a specific area when a user makes a search is depends on the type of the listing (ad_type) and their listing score (ranking_score) . There are four different listing types: simple listings , that appear last , up listings that rank above simple, premium listings that rank above up, and star listings that appear at the top of the list. Within each listing type group properties are ranked based on the listing score.

The size of the property (sq_meters) , its price and the area (geography_name) are the main search filters that users use in their initial search. The rest of the columns of the dataset are all further attributes of the properties listed and can be used as filters on a search. The year_of_construction column represents the year that the house was built and the value 2155 means that the house is under construction .




Assignment Part #1 (Marketing)


The marketing department wants to issue a press release regarding house prices in different areas of Athens. They ask if you could help them by providing some useful statistics that would show the price levels of houses for sale this month that real estate journalists would find useful. For this purpose, you will need to calculate tables that show some metrics, namely the mean, median and standard deviation of property prices per house type (subtype) and per area (geography_name). Before you calculate the final metrics, keep in mind that you should clean the dataset from any entries that you think should not be included in the analysis, because they will corrupt the right image of the price levels of each area.



Solution of Part 1

Data Cleaning

After examining the dataset for this part , i decided to remove some rows concerning non applicable values for this month & outlier prices

The logic behind removing rows is as follows :

  • First of all ,everything concerning under construction prices was removed from the analysis. This tasks requires an analysis on the prices of this month , thus , including under construction (future) prices did not seem realistic. Then some listings below 1000 euros were removed as unrealistic too

  • Then outlier prices were removed. Prices out of +/- 1.5*IQR range were removed as outliers but not for the entire dataset. Prices were removed specifically for each different geography_name group.

Explanation:

If outlier prices were removed across the entire dataset , then we would see that ~0% of beesy neighborhood listings would be removed , 1% for gentrification areas but 8% and 10% of south beach and northern sub respectively.

This propably happens due to these 2 areas being more expensive by default so it is common sense that they will have the most outliers accros the dataset . But since we need an analysis by geography it seemed more fair to remove outliers inside each geography name.

If we do the same removal method but splitting the data by geography name , the percentages removed go as follows : beesy neighborhood (8%) , gentrification area (9.5%) ,northern sub (7.5%) ,south beach (8.7%) which seems a lot more balanced.





Price Analysis

After doing the exclusion of the rows that apply to the foretold creteria (under_construction,undrealistically low prices,outliers) we end up with around 88% of the starting raw dataset and we can start exploring the prices in the requested groups:

By Geography Name









By Subtype







Detailed Table by both Geography & Subtype




Comments for the Marketing Team :

  • A listing residing in Northern Subs is on average 6.2 times higher than a listing in beesy neighborhood
  • Villas cost almost as much as an entire Appartment Complex (~900k euros)
  • As median prices go higher , their fluctuations are getting bigger as well . Meaning that in geographies or subtypes with higher prices , sd is higher as well.
  • Most of the listings concern apartments in the south beach with a median price of 330 k euros
  • The listing with the highest price is a maisonette in the nothern subs valued at 1.65 M euros , while the listing with the lowest price is a studio residing in beesy neighborhood valued at 8k euros




Assignment Part #2 (Sales)


The Sales Manager of SpaN , after conducting qualitative research, by asking different agents in each area in Athens, wants to examine the possibility of offering special discounts for some listing types, based on the competitiveness of each area. To decide what type of discount should be given to agents in each area she would need to see an analysis of the competitiveness of each area. A highly competitive area would mean that it would be hard for a simple listing to rank high in the search results of this area just by having a high ranking score. To help the sales manager decide the level of discount to be given to agents in each area, you would need to:



Solution of Part 2

The though behind the creation of the area & ad_type opportunity scores

After examining the dataset for this part , i decided to create a competition metric called opportunity score based on 2 combined key factors. Opportunity score measures if a simple listing ad would do great if it became up,premium,or star listing.

This means that if we get high opportunity score values the competitiveness in upper tiers of ad_types is lower so the opportunity to make them paid is bigger cause according to their ranking they would rank in a good position in the selected ad_type.

On the contrary low opportunity scores means that the area is quite competitive and even if the simple ad became paid it would have minor results in its ranking ,besides of course going above all other simple listings.

Having said that here are the 2 key factors used in the opportunity score creation:

  1. Measuring the volume of paid ad types in each area
  2. Measuring the percentage difference of the median ranking score between simple and paid ads in each geography name

For number 1 the metric type is : (1 - percent of ad_type in area) - eg. (In beesy neighborhood there are 2182 listings , 34 (1.55%) up listings , 30 (1.37%) star & 44 (2%) premium listings) - Their score would be 1-0.0155=0.9845 for up listings etc.

For number 2 the metric type is : (median(ranking_score of simple listings) / median(ranking_score of up or premium or star listings) ) -1 - eg. (In beesy neighborhood the median simple ranking score is 116.15, median ranking score of up listigns is 113.1. Thus (116.15/113.1)-1 = 0.026 )

So the opportunity score of simple listing ads in beesy neighborhood for up listings would be 0.9845/0.026 = ~ 37

Calculating the mean score of each ad_type in a geography name gives as our area opportunity score as well.





Results










Comments for the Sales Team :

  • South Beach & Beesy Neighborhood have the biggest Opportunity score, thus it is suggested for agents to try and make simple listings paid inside these 2 areas
  • In South Beach it is recommended to try and make simple listings , star listings because they will not only rank above up & premium but many ads will rank well inside the star category as well -In Beesy neighborhood it is recommended to try and make simple listings either up or star listings
  • Gentrification area seems to have the biggest competition ( low area opportunity score ~ 6 ) .Even if we provide discounts here listings will have difficulty ranking in paid types as well
  • Simple listings in Northern subs , while having a lower area opportunity score than south beach or beesy neighborhood , have a ranking score pretty similar to star listings inside the area. So it is recommended to have discounts from simple to star type ads




Assignment Part #3 (Product)


The product team of SpaN wants to launch a new page on the portal that would help agents decide the correct price they should set for a property for sale in Athens. The agent would need to input certain attributes of a property and an algorithm would value the property, based on historical data. The team is building an MVP (minimum viable product) that would be launched in beta, in order to measure the willingness of agents to use the new page and get feedback on the accuracy of the predictions, based on the experience of agents in the market.

They ask if you could help them identify what are the most important attributes that an agent would have to input to get a valid prediction of a property’s price valuation and also build the model that would predict the value (price).

Using the data from the given dataset: - Identify the most important attributes in predicting the price of a property. - Build a model that valuates each residential property



Solution of Part 3

Data specifics

First of all as a dataset already cleaned by outliers in price, i chose to use the cleaned per geography dataset from Part 1 used for the marketing report

The split between training and test set used is 80%-20% respectively.

Then we perform an NA analysis which variables we can use as predictors for the price in order to build the model and identify the most important variables in it.





As we can see all variables after balcony area have almost more than 10% of their entirety missing , thus they are all excluded from the analysis in order to minimize the effort into building the model.

Energy class is also removed , while renovation year is kept accepting that if the listing was not renovated it gets the value 0 instead of NA

We will also exclude irrelevant features as id,ranking_score,agent_id keeping only characteristics of each house



Model Specifics

To perform the predictions a simple linear regression model is selected due to easy interpretability and implementation.

The numeric data were all centered and scaled , while categorical data were transformed into dummy variables meaning, a factor with 2 categories will break up to 2 different binary columns one for each category.

To select the optimal number of the remaining features an rfe method was implemented (recursive feature elimination)

The rfe function suggest that we keep all the 13 features in our model but we can see (Tabe 5) that if we keep 11 we get almost the same results in RMSE,R-Squared and the sd of R-squared is the same. So i decided to keep the 11 most important . (Practically it is even less than eleven (8) since subsets of geography name are stand alone variables after data transformation of factors to dummy variables)

To plot the importance of each feature i used the varImp function which uses the absolute value of the t-statistic for each model parameter that is used.







By taking into account the above information in the final model we include (ordered by importance)

  • Square Meters
  • Geography Name
  • Year of Construction
  • Number of Bathrooms
  • Rooms
  • Number of Wcs
  • Kitchens
  • Balcony Area

The model by these variables has an R-Squared of around 0.66 (This means that the features selected explain 66% of the variance of Price, which is a considerable value but still has a lot room for improvement) & an RMSE (root mean squared error) of 4910.12 euros , while the average prediction is from 30-35 % of the actual price



Model Evaluation



Here is a visualization of how close each prediction is to the real price of the listing

With blue you can see the predictions while with green the actual prices , try and use the “Compare Data on Hover” button for better experience





Comments for the Product team & the Agents

Both the product team & the Agents should be aware that the current version of this model , overestimates prices for low value listings and underestimates prices for high value listings , it is even possible for some low value listings to have predictions below 0 which is something definitely not optimal.

On the other hand we created a model that runs only on 8 out of around 50 starting variables and still has a kind of good fit to the actual prices

  • We should run this in Beta and have the agents cross-validate these weaknesses in the real world
  • We expect the satisfaction for this to be higher for agents that manage listings of average prices and lower for agents who manage corner case listings (either of high or low value)





Future Steps

After the first beta iteration there is a lot of room for improvement:

We could

  • Try other data cleaning method , play around more with corner cases
  • Split our model into the four main geography areas
  • Try imputing some of the missing values and include more features to see if they improve predictions
  • Create price categories and create classification models to predict price ranges and not do point estimations like in regression
  • Create clusters of listings with similar characteristics , classify new listings to these clusters and then try to predict their prices by creating a model for each cluster










📈 Thank you for your time 📈